Building a 50M Corpus of Tajik Language
نویسندگان
چکیده
Paper presents by far the largest available computer corpus of Tajik Language of the size of more than 50 million words. To obtain the texts for the corpus two different approaches were used. The paper brings a description of both of them, discusses their advantages and disadvantages and shows some statistics of the two respective partial corpora. Then the paper characterizes the resulting joined corpus and finally discusses some possible future improvements.
منابع مشابه
Towards 100M Morphologically Annotated Corpus of Tajik
The paper presents a work in progress: building morphologically annotated corpus of Tajik language of the size more than 100 million tokens. The corpus is and will be by far the largest available computer corpus of Tajik: even its current size is almost 85 million tokens. Because the available text sources are rather scarce, to achieve the goal also the texts of a lower quality have to be inclu...
متن کاملEfficient Web Crawling for Large Text Corpora
Many researchers use texts from the web, an easy source of linguistic data in a great variety of languages. Building both large and good quality text corpora is the challenge we face nowadays. In this paper we describe how to deal with inefficient data downloading and how to focus crawling on text rich web domains. The idea has been successfully implemented in SpiderLing. We present efficiency ...
متن کاملMetadiscourse Markers in a Corpus of Learner Language: The Case of Iranian EFL Learners
Different issues have been probed in learner corpus research since the late 1980s.However, taking the im- portance of meta discourse markers (MDMs) in signposting academic discourse, their use in Iranian EFL learners‟ academic essays is an area of research in need of a more serious analysis. Contributing to this line of investigation, this paper reports a corpus-based study of the use of MDMs i...
متن کاملLanguage and the Socio-Cultural Worlds of Those Who Use it: A Case of Vague Expressions
The present study is an attempt to investigate the use of vague expressions by intermediate EFL learners. More specifically, the current study focuses on the structures and functions of one of the most common categories of vague language, i.e. general extenders. The data include a 22-hour corpus of English-as-a-foreign-language conversations. A comparison is also made between this corpus and a...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011